Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
This course gives an overview to understanding the possibilites of open science, open data science and open research tools. During the course, students will analyze and visualize data with statistical methods on open software tools.
Click here for my Github repository
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
dim(lrn14)
## [1] 166 7
The data is collected in Introduction to Social Statistics course, in fall 2014. The sample size is N = 166. The survey was originally conducted in Finnish. The data contains backgroud charasteristics age, gender and exam points on the course, about the students who answered the questionare. Students answer questions related on their learning approaches and their attitude towards statistics on likert scale (1 to 5). Observations where the exam points variable is zero are excluded.
Variable attitude describes the mean of questions related to students attitude towards statistics. Variables stra, deep and surf are means of set of questions related to learning strategies. Deep describes questions how deeply student is trying to understand the statistics. Stra describes how student is organizing studying and managing time. Surf is about surface learning i.e. if student is just memorizing things without a context, or if a she doesn’t find purpose for studying the topic.
Some background charasteristics are omitted from the data. The data does not contain many missing values. The meta text can be found here.
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(col=gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
The data is collected to see how different variables affect exam points (proxy for learning). Points and attitude have the highest correlation on the matrix above. Points is in absolute value clearly correlated with surf and stra, but not much correlated with age or deep learning. Attitude is very highly correlated with the points, indicating that it may be very important factor for exam points. However, high correlation does not imply an causal relationship between the variables.
Age is not highly correlated with any of the variables in the sample. Also attitude has clearly lower correlations to other variables than points (0.437 for points, less than 0.2 in absolute values for other variables).
The data contains a larger portion of females than males. On average, males have a higher value for attitude than females. Males also have on average lower values for surf (surface learning). For other variables, the distribution is quite equal for men and women. Males who partipated to the course may be more motivated to learn statistics than females.
# create an plot matrix with ggpairs()
ggpairs(lrn14, lower = list(combo = wrap("facethist", bins = 20)))
If the exam points is the target (dependent) variable, we should choose explanatory variables with highest correlation to exam points in absolute value. Points has highest correlation in absolute value with variables attitude (0.437), stra (0.146) and surf (0.144 in absolute value).
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = lrn14)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The coefficient of stra is 0.8531, indicating a positive relationship between good learning strategies and exam points. Surf has a coefficient -0.5861, indicating that surface learning could have a negative impact to exam scores. However, explanatory variables stra and surf are not statistically significant on 10 % significance level with p-values 0.11716b and 0.46563. Thus, I remove these variables from the model. I also plot the simple model after removing these variables.
# a scatter plot of points versus attitude
qplot(attitude, points, mapping = aes(col=gender), data = lrn14) + geom_smooth(method = "lm")
From the plot we can clearly see, that attitude towards statistics is positively correlated with exam points. There may be few outliers on the bottom-right corner of the picture (a few people with weak points from exam but have a positive attitude towards learning statistics).
# fit a linear model
my_model <- lm(points ~ attitude, data = lrn14)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Explanatory variable attitude is statistically significant on 1 % significance level (p-value 1.95e-09).
The coefficient of attitude is 3.5255 (values from summary). The coefficient describes that if the value of attitude increases by 1 (a student has more positive attitude towards studying statistics), her exam score is expected to be 3.5255 points higher. Attitude has a positive statistical relationship to the dependent variable (exam score).
Multiple R-squared has a value of 0.1906 and Adjusted R-squared has a value of 0.1856. Both of these values are low. Thus, model does not account much of the variance in the data. However, low R-squared values do not indicate that the model is not adequate and it is common to have such low values in models, that attempt to predict human behavior. The statistically significant positive relationship between explanotary variable (attitude) and dependent variable (exam points) is the main finding here. The adjusted R-squared is almost the same as on the previous model with three variables, indicating that the fit of the model remains the same.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
For using a linear regression model, one should check that error term is be normally distributed with a constant variance and expected value of zero.The error terms should not be correlated and the error term does not depend on explanatory variables.
On the first plot (residuals versus fitted), there seems to be a reasonably random spread of the points. This supports the assumption that size of the errors does not depend on explanatory variables (constant variance assumption). The second graph (Normal Q-Q-plot) shows are reasonable fit, since the points fall close within the line. Hence, it is safe to assume that the errors of the model are normally distributed. From the third plot (residuals vs Leverage) no single observation stands out. Therefore, there are no outliers that should be removed. In conclusion, assumptions for linear regression model hold, and the model seems adequate.
library(tidyr); library(dplyr); library(ggplot2); library(GGally)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep= ",", header= TRUE)
Take a look of the data:
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc)
## [1] 382 35
The data contains 35 variables and 382 observations. The data measures secondary education achievements and alcohol consumption in two Portugese schools. The data is combined from two questionares: one about education achievements, and another about education achievements. The observations are combined with factors that are common in both data sets, for identifying the relationship between education achievements and alcohol consumption.
The meta text can be found here.
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
I chose a list of variables, that I want to consider
cor(alc$high_use, as.numeric(alc$sex))
## [1] 0.1962526
Sex is probably correlated with alchol consumption. I would assume that males drink more than females. Correlation is 0.2 between high use and sex.
cor(alc$high_use, alc$goout)
## [1] 0.352999
Going out may also increase the probability of drinking, and it’s highly correlated with high use (0.35). I would guess, that there is a causal relationship.
However, I should not use failures with absences, since these two are not independent (absences may cause failures). I will also remove health, since bad health may be caused by drinking.
cor(alc$high_use, alc$G3)
## [1] -0.03487925
cor(alc$high_use, alc$absences)
## [1] 0.2229386
So I end up with a list: - absences (positive correlation with alcohol consumption) - sex (positive correlation for males) - going out ( positive correlation 0.22 for high use ) - G3 (negative correlation -0.03 to high use)
g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
g1 + geom_bar() + ggtitle("Alcohol use by sex")
Higher alcohol use is clearly more common for males.
g2 <- ggplot(alc, aes(goout, fill = sex))
# draw a bar plot of high_use by sex
g2 + facet_wrap("sex") + geom_bar()
There’s no big difference of going out between sexes.It seems that there’s more males who go out often than females. On average, there doesn’t seem to be a big difference.
# initialize a plot of high_use and G3
g3 <- ggplot(alc, aes(x = high_use, y = goout, col = sex))
# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("going out") + ggtitle("Going out by alcohol consumption and sex")
Going out clearly correlates positively with high use.
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade") + ggtitle("Grades by alcohol consumption and sex")
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
For males, high use of alcohol seems to have a negative effect on grade, but for females the effect is very small.
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
High use of alcohol is clearly correlated with absences.
The boxplots support all the hypothesis I made previously about the relationships (correlations). However, it’s not possible to draw conclusions about causality from these descriptive plots.
m <- glm(high_use ~ famrel + goout + absences + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + goout + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8278 -0.7554 -0.4827 0.7282 2.5976
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.69810 0.65848 -4.097 4.18e-05 ***
## famrel -0.44809 0.14052 -3.189 0.00143 **
## goout 0.79976 0.12510 6.393 1.63e-10 ***
## absences 0.06331 0.01627 3.891 9.96e-05 ***
## sexM 1.06742 0.26378 4.047 5.20e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 371.08 on 377 degrees of freedom
## AIC: 381.08
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) famrel goout absences sexM
## -2.69810044 -0.44808654 0.79975625 0.06330634 1.06742473
To find the most parsimonious model, I use the same approach as Vehkalahti, Kimmo & Everitt, Brian S. and therefore I remove one variable at time from the model, and select the model with lowest AIC. ((2019, not yet published, see course materials) Multivariate Analysis for the Behavioral Sciences, 2nd Edition. Chapman and Hall/CRC, Boca Raton, Florida.).
First, I remove sex:
m <- glm(high_use ~ famrel + goout + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8808 -0.7599 -0.5297 0.8415 2.5030
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.28530 0.62653 -3.648 0.000265 ***
## famrel -0.40001 0.13505 -2.962 0.003057 **
## goout 0.79744 0.12153 6.562 5.33e-11 ***
## absences 0.05691 0.01614 3.526 0.000422 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 388.31 on 378 degrees of freedom
## AIC: 396.31
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) famrel goout absences
## -2.28529556 -0.40001032 0.79744487 0.05691436
Second, I remove absences:
m <- glm(high_use ~ famrel + goout + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + goout + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5782 -0.7773 -0.5086 0.8240 2.5739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2055 0.6284 -3.510 0.000448 ***
## famrel -0.4689 0.1381 -3.397 0.000682 ***
## goout 0.8057 0.1217 6.620 3.6e-11 ***
## sexM 0.9588 0.2541 3.773 0.000162 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 387.54 on 378 degrees of freedom
## AIC: 395.54
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) famrel goout sexM
## -2.2054721 -0.4689221 0.8056623 0.9587523
Third, I remove goout:
m <- glm(high_use ~ famrel + absences + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6402 -0.8524 -0.6013 1.0729 2.1499
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.59731 0.51994 -1.149 0.2506
## famrel -0.32188 0.12600 -2.555 0.0106 *
## absences 0.07119 0.01773 4.014 5.96e-05 ***
## sexM 1.05932 0.24610 4.304 1.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 419.25 on 378 degrees of freedom
## AIC: 427.25
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) famrel absences sexM
## -0.59731225 -0.32187856 0.07118883 1.05931782
Fourth, I remove family relationships:
m <- glm(high_use ~ goout + absences + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ goout + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8180 -0.7872 -0.5325 0.7930 2.4627
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.26004 0.48359 -8.809 < 2e-16 ***
## goout 0.74880 0.12058 6.210 5.30e-10 ***
## absences 0.06601 0.01694 3.897 9.73e-05 ***
## sexM 1.00074 0.25778 3.882 0.000104 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 381.44 on 378 degrees of freedom
## AIC: 389.44
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) goout absences sexM
## -4.26004047 0.74880103 0.06601437 1.00073889
The AICs are - 381.08 - 396.31 - 395.54 - 427.25 - 389.44
The original logistic regression model should be selected since it has the lowest value for AIC. Thus, there is no need to remove variables from the logistic regression model. This model should be the most parsimonious model that describes the data in adequate way.
# find the model with glm()
m <- glm(high_use ~ famrel + goout + absences + sex, data = alc, family = "binomial")
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.0673333 0.0177991 0.2371555
## famrel 0.6388494 0.4830454 0.8395914
## goout 2.2249985 1.7533774 2.8667377
## absences 1.0653532 1.0335056 1.1030990
## sexM 2.9078813 1.7469330 4.9254515
Odds ratios of going out, absences and sex male are above one. This implies that these variables are positively assosiated with high use of alcohol. On the other hand, good family relationships is negatively assosiated with high use of alcohol.
# fit the model
m <- glm(high_use ~ famrel + goout + absences + sex, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 254 16
## TRUE 59 53
summary(alc$high_use)
## Mode FALSE TRUE
## logical 270 112
According to the table, the model predicts alcohol consumption that is not high right for 254 individuals, when the true value is 270.The model clearly has a weakness with predicting high alcohol consumption, since the model is only able to predict 53 individuals with truely high alcohol consumption, while the true value is 112.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
The plot visualizes the previous finding: The model is good at predicting individuals, who have low alcohol consumption, but it performs worse predicting high alcohol consumption.
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66492147 0.04188482 0.70680628
## TRUE 0.15445026 0.13874346 0.29319372
## Sum 0.81937173 0.18062827 1.00000000
In total, the model predicts 75 individuals wrong (roughly 20 % of the sample). The above table is consistent with the table shown before the plot, but now the same information is given in percentages. According to the prediction, 81,2% of individuals are not high users of alcohol and 18 % are high users. In the sample, 70% are not high users of alcohol, and 29 % are high users. There is a 10 percentage point difference in the number of high alcohol consumers in the sample and the prediction.
The model performs much better than guessing in the case of predicting low alcohol consumption. However, individuals with high alcohol consumption the model performs poorly, and it is only slightly better than quessing.
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.1963351
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
The training error is 19,6 %, and this is implies that the model predicts roughly 20% of the true values wrong.
# the logistic regression model m and dataset alc (with predictions) are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.1963351
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2198953
The average number of wrong predictions in the training data is 19,6 %. The average number of wrong predction in the 10-fold cross-validation is 0.209, and this is clearly lower than in the model introducted in data camp (with 0.26 error).
I create a model with all variables (excluding alcohol use and high use) on the dataset.
m <- glm(high_use ~ school + sex + age + address + famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + nursery + internet + guardian + traveltime + studytime + failures + schoolsup +
famsup + paid + activities + higher + romantic + famrel + freetime+ goout+ Dalc+ Walc + health + absences + G1 + G2 + G3 , data = alc, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 270 0
## TRUE 0 112
summary(alc$high_use)
## Mode FALSE TRUE
## logical 270 112
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
With this model, there is no errors in predictions, since all data is fitted to the model. The training error is zero. Let’s put more variables the previous model with four variables, and add age, studytime, medu, fedu, mjob, fjob and G1, G2 and G3:
m <- glm(high_use ~ famrel +age + goout + absences + sex +G1 + G2 + G3 +Medu + studytime + Fedu + Mjob + Fjob , data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 22
## TRUE 57 55
summary(alc$high_use)
## Mode FALSE TRUE
## logical 270 112
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2068063
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
The training error is 0.2068063 and the ammount of wrong predictions clearly increases.
Remove G1, G2, G3:
m <- glm(high_use ~ famrel +age + goout + absences + sex +Medu + studytime + Fedu + Mjob + Fjob , data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 246 24
## TRUE 57 55
summary(alc$high_use)
## Mode FALSE TRUE
## logical 270 112
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2120419
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
Removing these variables increased the number of predictions of high use by 2 for people who actually are not high users. The model performs slightly worse than the model with 10 explanotary variables. The training error increases slightly.
Next, remove medu, fedu, mjob, fjob.
m <- glm(high_use ~ famrel +age + goout + absences + sex + studytime , data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 18
## TRUE 65 47
summary(alc$high_use)
## Mode FALSE TRUE
## logical 270 112
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2172775
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
Now the model is relatively worse on predicting low use, but it performs better on predicting high use. The training error increases with a very small change.
m <- glm(high_use ~ famrel + absences , data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 259 11
## TRUE 100 12
summary(alc$high_use)
## Mode FALSE TRUE
## logical 270 112
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2905759
# kun prob = 0, kuinka usein väärässä
# kun prob = 1, kuinka usein oikeassa
# kun prob= alc$probability
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
The training error increases, and the model clearly yields worse predictions. It seems that removing a couple of variables has a very small change on how accurate the predictions are, unless the number of explanatory variables is very low.
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
library(GGally)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, plotly, stats
## lag(): dplyr, stats
## select(): dplyr, plotly, MASS
data('Boston')
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The dataset contains housing values in suburbs of Boston, and other variables about that describe the area i.e. crime rate, accessibilty to radial railways and air quality. The data contains 506 observations and 14 variables. All variables are numbers, and only two of them are integers.
The meta text can be found here.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Crime rate seems to variate a lot between different areas. Some areas do have a low proportion of non-retail business acres per town, so the town has separeted residental areas. Most of the housing is not located by the Charles River. Nitrogen oxides concentration doubles between the minmum and maximum, indicating that the air pollution is concentrated to some areas.
Segregation may be an issue here, since in some areas there are not many black people living in. However, the difference between first quater and third quater is low, so there are only very few areas where not many black people are living in.
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
# print the correlation matrix
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos="b", tl.pos="d", t1.cex=0.6)
## Warning in text.default(pos.ylabel[, 1] + 0.5, pos.ylabel[, 2],
## newcolnames[1:min(n, : "t1.cex" is not a graphical parameter
## Warning in title(title, ...): "t1.cex" is not a graphical parameter
Weighted mean of distances to five Boston employment centres is highly negatively correlated with proportion of owner-occupied units built prior to 1940, nitrogen oxides concentration and proportion of non-retail business acres per town. Old nabourhoods are typically residental areas. The air contains more nitrogen oxides in areas with lots of traffic, like in the active employment centers. Areas with non-retail businesses probably have less opportunities for employment, and these areas are more likely to be residental areas.
Median value of owner-occupied homes has a high negative correlation with percentage of lower status people of the population.
Index of accessibility to radial highways is highly positively correlated with full-value property-tax rate, indicating that properties close to highways are more expensive (higher value of the property equals higher tax).
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Now the mean of the variables is scaled to zero. Now it’s easier to see from i.e. variable crime that high crime rate is really consentrated to small area, since the minimum value and first quantiles is close to the mean, but the maximum value is very high. Segregation may be a problem in a couple of areas, since the minimum of variable black is quite far away from the mean, indicating that there are some residental areas with very few black people living in there. Some areas are quite far from the employment centres (dis max 3.9566).
# From matrix to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Now the crime rate is split to four quantiles. I Split the data to training (80% of the data) and test sets (20 % of the data) and removed the original crime rate variable, as instructed in exercise 6.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2425743 0.2450495 0.2698020
##
## Group means:
## zn indus chas nox rm
## low 1.08083795 -0.8943943 -0.11163110 -0.8800110 0.470091268
## med_low -0.09084544 -0.2974862 0.04906687 -0.5663270 -0.126221526
## med_high -0.39022525 0.1826612 0.16512651 0.3813835 0.008125402
## high -0.48724019 1.0169738 -0.01948777 1.0407399 -0.388118352
## age dis rad tax ptratio
## low -0.9338618 0.9077873 -0.6959263 -0.7592490 -0.50263928
## med_low -0.3559213 0.3253729 -0.5588134 -0.5204595 -0.05440252
## med_high 0.4435240 -0.3462913 -0.4389595 -0.3198335 -0.19175058
## high 0.8049719 -0.8451343 1.6395837 1.5150965 0.78247128
## black lstat medv
## low 0.3715361 -0.78036789 0.576251686
## med_low 0.3155938 -0.12938995 0.001647569
## med_high 0.1318285 0.01316435 0.096266769
## high -0.8367362 0.83659614 -0.658639601
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.164389572 0.830481133 -0.92716146
## indus 0.016212860 -0.098096658 0.21570395
## chas -0.020152805 -0.002890718 0.20181786
## nox 0.358022060 -0.624629773 -1.45044357
## rm -0.001217507 -0.034059074 -0.09351584
## age 0.275071835 -0.427493886 -0.33109216
## dis -0.118260639 -0.316701799 0.02399022
## rad 3.327591737 1.201305025 0.07126450
## tax 0.072289762 -0.401866808 0.47066079
## ptratio 0.146361446 -0.006009057 -0.28893602
## black -0.106148344 0.009681031 0.09780773
## lstat 0.124165435 -0.177824080 0.44044775
## medv 0.052393518 -0.363980200 -0.28121416
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9525 0.0355 0.0120
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Two clusters, that are individually linear. However, together these datasets would make the linearity look quite different. High crime is consentrated to one cluster, and low rates to other cluster.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 15 1 0
## med_low 3 14 11 0
## med_high 0 9 16 2
## high 0 0 0 18
The model seems to predict very well high crime rate, only predicting once high when the real value was not high. Similarly, with medium high crime rate, the model predicts 15 times medium high, 3 times medium low and once low. The model failed to predict four times medium high, but succeeded fifteen times.
With true value medium low, the model predicts less than 40 % correctly. With low crime rate, the model fails to predict correctly roughly 30 % of the true values. However, the failed predictions usually indicated low or medium low crime rate. Thus, the model seems to be good at predicting high crime rate and medium high crime rate, but it fails predicting two lowest quantiles of crime rates.
Open the data again and standardize it again
data('Boston')
# euclidean distance matrix
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Calculate the distance between observations using Euclidean distance matrix and Manhattan distance matrix:
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Euclidian distance would be a good measure for distance, since it’s often used for scaled data.
boston_scaled <- as.data.frame(boston_scaled)
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled[1:7], col= km$cluster)
pairs(boston_scaled[6:10], col = km$cluster)
pairs(boston_scaled[10:14], col = km$cluster)
I compared the plots with different numbers of clusters. It would seem reasonable to use two or three clusters, and I ended up using two since the clusters are this way very clear on the plots.
Crime rate is very often clustered to two clusters, when looking at the first plot i.e. proportion of residential land zoned for lots and proportion of non-retail business acres per town. Crimes are centered to areas with low proportion of residential land zoned for lots. Areas with high proportion of non-retail business acres per town have high variation in crime rates, but overall these areas have a higher crime rate.
Super bonus:
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)
Here the color is defined by the crime classes.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= km$cluster[1:404])
The second plot doesn’t look as I expected it to. I thought there would be two clusters, as how one can see by eye. Now the clustering seems very random. From the first plot we can see, that high crime rates are clearly clustered away from others, indicating that areas with high crime rates have different charasteristics from other areas.
library(ggplot2)
library(GGally)
library(dplyr)
library(corrplot)
human <- read.table("human.txt", header = TRUE)
summary(human)
## se.ratio LFPR.ratio Expected.Education Life.Expectancy
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI.cap MMR ABR PR
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ se.ratio : num 1.007 0.997 0.983 0.989 0.969 ...
## $ LFPR.ratio : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Expected.Education: num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Expectancy : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI.cap : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ MMR : int 4 6 6 5 6 7 9 28 11 8 ...
## $ ABR : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ PR : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
The data contains eight variables and 155 rows.
The dataset is a combined from a data of Human Development index and Gender Inequality index. These datasets are from United Nations Development Programme’s Human Development reports.
The variables correspond to:
See the datapage for further information here.
Technical notes can be found here.
ggpairs(human)
cor(human) %>% corrplot
Maternal mortality ratio is positively correlated with adolescent birth ratio, indicating that young mothers face risks when giving birth. Expected education and maternal mortality rate have a high negative correlation. Expected education and life expancy are also highly positively correlated. Expected education and adolescent birth ratio are highly negatively correlated. Life expancy is negatively correlated with maternal mortality ratio and adolescent birth ratio. It would be important to study, whether education has a negative causal effect on adolescent birth ratio.
# Mean again zero. Check how affects the variance
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
If the data is not standarized, one principal component would explain alone the variotion on the data. However, this graph remains uninformative and not useful for interpretation.
human_std <- scale(human)
pca_human <- prcomp(human_std)
s <- summary(pca_human)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], sub = "The dimensionality of human development reduced to two principal components (PC). ")
First PC captures more than 50% of the variance of the original eight variables.
Ratio of females and males in the labor force is almost orthogonal to all variables excluding percent repsentation in parliament, indicating that ratio of females and males in the labor force has very low correlation with these variables. Same applies to percetage of representation in parliament.
Maternal mortality ratio is positively correlated with adolesecent birt ratio, since arrows in the plot point at same direction. These variables are highly negatively correlated to expected years of education, gross national income per capita, ratio of females and males with at least secondary education and life expectancy at birth. These last four variables are positively correlated between each other.
All arrows in the plot are quite long, indicating that each variable has non-zero standard deviations. Percent representation in parliament and ratio of females and males in the labor force contributes to the second principal axis (PC2), since these arrows do point closely to same direction as the PC2-axis. Thus, rest of the variables contribute to the first principal axis (PC1). Therefore, the PC2 axis can be interpreted as how much females participate in parliment representation and to labor force. PC1 axis is trickier to interpret, since more variables are related to it. It describes the socio-economical conditions of women in a country. Low values in PC1 axis indicates good conditions for women and high values the opposite. High values in PC2 implies that females a large portion of females participate to labor force and are represented in parliament.
For example, Nordic countries in top-right corner of the plot have low values in PC1-axis and high values PC2-axis, indicating good socio-economical conditions and high participation rate of females to labour force and parliament. I.e. Sudan, Mauritania and Afganistan are the opposite: these countires have high values for PC1 axis and low values for PC2 axis, indicating worse conditions for females, if compared to Nordic countries.
300 consumers were interviewed for the data. The data contains 300 observations and 36 variables. Consumers answered questions related to their tea consumption preferences, where they buy it and when they consume it. They answered questions related to their background i.e. sex, age and questions related to socio-economic background. The data is downloaded from the FactoMineR package for R. More information can be found here.
I selected columns “Tea”, “How”, “how”, “sugar”, “where” and “lunch” for the analysis.
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.4
library(ggplot2)
library(tidyr)
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.4
data(tea)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
Most of the tea is consumed using a tea bag. Usually no milk or lemon is added to the tea. Most popular tea type is Earl Grey, and secondary choice is black tea. Only few people consume green tea. Half add sugar to tea. Usually, the tea is bought from chain store.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
For dimension 1, the vtest value is lower than 1.96 for categories alone and other, indicating that these categories are not significantly different from zero. For dimension 2, black tea test value is lower than 1.96 in absolute value. Thus, for dimension 2, black tea is not significantly different from zero.
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
The distance between variable categories gives a measure of their similarity. Dim 1 explains 15,24 % of variation of the variables, and Dim 2 explain 14.23 % of the variation. These dimensions do not explain much of the variation, and the dimensions explain almost equally as much variation as the other dimension on the plot.
Tea shops and upackaged tea are closely related. Tea bags are closely related to chain stores. Tea bags + unpackeged tea together are related to chain stores. Green tea is different from the other categories. Milk and sugar seem to be added often to Earl Grey, and black tea is often consumed without sugar.
plotellipses(mca, keepvar=(1:5))
With the confidence ellipses, we can see that the confidence ellipses are not large. The confidence ellipses are separated from each other, so the different populations on the plot are distinct.